Often times decision makers and technical advisors find themselves needing to collect data and plot it in the form of a map for a given country or region to share with other stakeholders. This is particularly true in the field of Public Health. Short deadlines, increasing loads of work, continuous changes in available data and lack of technical knowledge, can make finding time to plot this quantities somehow challenging. In this tutorial, we will work our ways to making a malaria prevalence map for Ghana using data from the Demographic and Health Surveys Program (DHS). We will access the data using the DHS r package rdhs and will select the malaria prevalence indicator for all surveys conducted in Ghana. We will create a time series faceted plot of this indicator using ggplot. Later, we will move onto making our prevalence map. We will plot the administrative areas of Ghana and discuss what a shapefile is. We will join our spatial data to the malaria prevalence per region in Ghana and produce our first prevalence map. Finally, we will make an interactive map and discuss a possible integration to Shiny dashboards or apps.
Malaria is an endemic disease that remains a real public health problem as it continues to be a major cause of morbidity and mortality worldwide. According to the World Health Organization (WHO), the number of malaria cases is estimated at more than 229 million and the main burden of disease (more than 95%) is in sub-Saharan Africa with around 409 thousands deaths each year.
Malaria is a mosquito-borne infectious disease caused by a parasite of the Plasmodium group. It is most commonly spread by an infected female Anopheles mosquito that while feeding spreads the parasite into the human or animal host. The parasite then mutates within the host causing symptoms such as fever, tiredness, vomiting, and headaches. In severe cases it can lead to death. An individual that has survived a first infection of malaria and who has a continuous exposure to malaria creates some resistance to new infections resulting in milder symptoms. Thus, the disease severely affects children.
The disease is widespread in the tropical and subtropical areas including Sub-Saharan Africa, Asia and Latin America.
Some definitions:
Ghana is a country along the Gulf of Guinea and the Atlantic Ocean, in the sub-region of West Africa. It neighbors the Ivory Coast in the west, Burkina Faso in the north and Togo in the east. The WHO reports that Ghana had almost 5 million malaria cases in 2019 and 11 thousand deaths for a population of 30 million people. Mortality rates have been in decline since the year 2000 and morbidity since 2014. Currently, Ghana is composed by 16 regions. For this tutorial we will use the old Ghana administrative division of 10 regions. We do this as all the available surveys that we will work with were conducted before this administrative change.
The Demographic Health and Survey Program (DHS) has provided technical assistance to more than 400 surveys in over 90 countries, providing a fantastic database to understand the health and population trends in developing countries. The surveys conducted by DHS covered topics such as malaria, nutrition, HIV/AIDS, family planning, fertility and maternal and child health. It is primarily funded by the U.S. Agency for International Development (USAID), but contributions from other donors and the participating countries also support surveys.
In this tutorial we will learn the following topics:
This tutorial is targeted to anyone who is already familiar with R and tidyverse (dplyr and ggplot), but is not an expert. This person, would like to represent their data in a form of a map and would, otherwise, spend a long time trying to do this on their own.
As we will use data from DHS, this tutorial is also ideal for people working in public health, such as decision makers, who would like to know how to access this data and visualize it for their work. In this tutorial we will look at malaria indicators from DHS in Ghana. However, it is important to note that the DHS Programme does not only collect data on malaria but has collected, analyzed, and disseminated data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries.
Familiarity with tidyverse, specifically with dplyr and ggplot.
We will work on Rstudio and the following libraries will have to be already installed and loaded in their environments:
Participants will need to download the ./GHA_shapefile folder provided in this Git repository and place it in their working directory.
Internet connection is required to access the DHS data.
Total duration ~3hs. The times per section are estimated as maximum times.
Tatiana works at the Swiss Tropical and Public Health Institute (Swiss TPH) in Basel, Switzerland. She is part of the Analytics and Intervention Modelling group lead by Emilie Pothin. There, she does data analysis and mathematical modelling supporting countries in Africa to make evidence-based decisions to reduce their malaria burden. Currently, she is responsible for the support to Mozambique and Ghana.
Tatiana has a PhD degree in Physics from the Federal University of Ceara (Brazil) and a M.Sc-Bachelors degree in Physics from the University of Buenos Aires (Argentina). During her doctoral studies she got interested in knowing what visual strategies would people use when looking for Wally and wrote a whole thesis about it. Afterwards, she participated in a project-based data science course (S2DS), working for the Food Standard Agency in the UK. During this experience, Tatiana realized that she loves working in a team and that she is passionate about data visualization. She knew she wanted to be in a position where she could turn data into value and help in the decision making process.
She works in R EVERYDAY.Hello! Welcome to this tutorial on malaria, DHS and maps!
Let’s first set our environment. We will use these libraries. If you haven’t done it yet, go ahead and load them.
# to handle data
library(tidyverse)
# to extract public data
library(rdhs)
# to plot maps
library(rgdal)
library(raster)
library(tmap)Also, make sure to have the folder “./GHA_shapefile” in your working directory.
Now that we are all set, let’s load some data! We are gonna load data that comes from the DHS Program. The data is currently available through their r package and it is the one that’s shown in the DHS Stat Compiler.
Okay, let’s dive in!
First, let’s check what are the survey characteristics available:
## [1] 93
Remember we will focus in Malaria indicators, but there is a lot of other things to look at.
Now we want to know what specific indicators are available for a given type of survey. In our case, for the malaria surveys.
# load all tags for possible indicators
indicators <- dhs_indicators()
indicators[order(indicators$IndicatorId),][1:5,c("IndicatorId", "Definition")]Explore the indicators database, what indicators does it have? Pay special attention to the field “Level1”.
Let’s filter those indicators that are malaria related,
Now we want to find the malaria indicator for malaria prevalence confirmed by RDTs. What do you see? What happens if you write them in all capital letters? and if you look only for RDTs?
Fantastic! So we now that the indicator that we are interested in has the ID: “ML_PMAL_C_RDT”.
Now we are gonna download the DHS data for the ML_PMAL_C_RDT indicator for Ghana. Countries names can be accessed though a country ID, which we can see by looking into the dhs_countries function.
For Ghana, the country ID is GH. To extract the data from DHS into a dataframe we use the function dhs_data with the fields
countryIds = "GH"indicatorIds = "ML_PMAL_C_RDT"breakdown = "subnational" , so we can have prevalence values per region (the default is to give estimates at the national level)pr_gha <- dhs_data(countryIds = c("GH"), indicatorIds = "ML_PMAL_C_RDT", breakdown = "subnational")
pr_gha## [1] "Western" "Central"
## [3] "Greater Accra" "Volta"
## [5] "Eastern" "Ashanti"
## [7] "Brong-Ahafo" "Northern, Upper West, Upper East"
## [9] "..Northern" "..Upper West"
## [11] "..Upper East"
We have our data! Now, let’s tidy it up a bit so it’s easier to handle moving on. We will rename three columns that we will need to use to make our plots. We will use the rename function from tidyverse/dplyr and change:
The DHS data has an “extra” region which is a composition of the Northen, upper West and upper East regions, we will remove this region called “Northern, Upper West, Upper East” by selecting (filtering) all Regions that are not named like that,
#tidy some data
pr_gha_tidy = pr_gha %>% dplyr::rename(Region = CharacteristicLabel
, Prevalence = Value
, Year = SurveyYearLabel) %>%
filter(Region != "Northern, Upper West, Upper East")Now let’s plot the time series of prevalence versus year, using tidyverse/ggplot. We will plot the dataframe pr_gh_tidy and will plot a line and point plot with a shaded area denoting the confidence interval for the prevalence value.
ggplot(pr_gha_tidy, aes(x= as.numeric(Year), y = Prevalence, ymin = CILow, ymax = CIHigh))+
# shaded area for the confidence interval
geom_ribbon(alpha =.3, fill = "aquamarine4")+
# line plot por mean prevalence
geom_line(color = "aquamarine4")+
# point plot for mean prevalence
geom_point(size = 2, color = "aquamarine4")+
# create facets for each region
facet_wrap(~Region, ncol = 5)+
# choose the minimal theme
theme_minimal(base_size = 18)+
# write our labels
labs(x = element_blank()
, y = "Prevalence (%)"
, caption = "Source: DHS")+
#specify breaks in scale
scale_x_continuous(breaks = c(2014,2016,2019))+
theme(panel.spacing = unit(2, "lines"))Can you make a similar plot to this for the indicator: “ML_NETC_C_ANY”?
What does this indicator relates to?
Careful! does this indicator have Confidence intervals? If not, then don’t plot them.
Result:
First we need to get the shapefile for the setting that we want to analyze. In this case I took the first administrative region for Ghana from this website https://data.humdata.org/dataset/ghana-administrative-boundaries.
But, what is a shapefile?
We look in Wikipedia…
The shapefile format is a geospatial vector data format for geographic information system (GIS) software. Mandatory files
.shp— shape format; the feature geometry itself {content-type: x-gis/x-shapefile}.shx— shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly {content-type: x-gis/x-shapefile}.dbf— attribute format; columnar attributes for each shape, in dBase IV format {content-type: application/octet-stream OR text/plain}
That’s why we have so many other files in our GHA_shapefile folder!
Load the shapefile using rgdal
Admin1 <- file.path("GHA_shapefile"
, "GHA_admbndp1_1m_GAUL.shp") %>% readOGR(verbose = FALSE)
#what does this look like?
print(Admin1)## class : SpatialPolygonsDataFrame
## features : 10
## extent : -3.25542, 1.191781, 4.736723, 11.1733 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## variables : 8
## names : CNTRY_NAME, CNTRY_CODE, ADM1_NAME, ADM1_CODE, Rowcacode1, HRname, HRpcode, HRparent
## min values : Ghana, 94, Ashanti, 941324, GHA024, Ashanti, GH024, GH
## max values : Ghana, 94, Western, 941333, GHA033, Western, GH033, GH
## [1] "Greater Accra" "Central" "Western" "Eastern"
## [5] "Ashanti" "Volta" "Brong Ahafo" "Northern"
## [9] "Upper West" "Upper East"
## [1] 10
To make our maps we are gonna use the library tmap, this library has a syntax similar to the one of ggplot and allows us to plot maps and spatial objects in a similar fashion as we would do with dataframes in ggplot.
First we need to say what is the data that we are gonna plot, in this case Amdin1 to tm_shape Second we decide what we want to plot, in this case we want to plot the 10 districts in Ghana, so we want to plot polygons. For that we call tm_polygons and choose the column name that we want to have in the legend describing each polygon, “ADM1_NAME”
We have a map! But it’s a bit messy. The legend is on top of our map, the color palette could be improved and maybe we would like to add some references and the names of the region. Let’s look at two options: keeping the region name in the legend or writing the region name on top of the region.
We will use the functions:
Go on and check the documentation for these functions to see what they do. (Typing ?tm_layout in your Rstudio console will point you to the documentation of that function)
#tmap_mode("plot")
tm_shape(Admin1)+
tm_polygons("ADM1_NAME"
, palette ="viridis"
, border.col = "white"
, title = "Region")+
tm_compass(type= "rose"
, position=c("right", "bottom")
, size = 4)+
tm_scale_bar(position = c("right", "bottom")
)+
tm_layout (frame = F
, asp = 0
, legend.outside.position = T
, legend.text.size = 1.2
, legend.title.size = 1.4)# or showing the names inside the regions
tm_shape(Admin1)+
tm_polygons("ADM1_NAME"
, palette ="viridis"
, border.col = "white"
, legend.show = F)+
tm_text("ADM1_NAME"
, scale = 1.1)+
tm_compass(type= "rose"
, position=c("right", "bottom")
, size = 4)+
tm_scale_bar(position = c("right", "bottom")
)+
tm_layout (frame = F
, asp = 0)Play around with the options in tm_layout, what changes?
Try using other color palettes: “plasma”, “RdBu”, “Blues”, “magma” …
What about other types of compass: “arrow”, “4star”, “8star”, “radar”, “rose”
What else would you like to add to this plot?
Now we can finally plot our prevalence map! For that we need to merge the Ghana shapefile with our DHS dataframe. We will use the merge function from the raster package.
We are gonna merge these two objects by the Regions names. That means that, for example, for the polygon in our shapefile corresponding to the region Ashanti, we are gonna want to join it to the correspondent malaria prevalence for Ashanti for each survey year.
Then, our first step is to check how are the Regions names spelled in each object and modify it so that they match. See how we use the code pr_gha_tidy$Region[which(pr_gha_tidy$Region== "OLD REGION NAME")] = "NEW REGION NAME" to look for all occurrences of the old region name and replace by the new name.
## [1] "Greater Accra" "Central" "Western" "Eastern"
## [5] "Ashanti" "Volta" "Brong Ahafo" "Northern"
## [9] "Upper West" "Upper East"
## [1] "Western" "Central" "Greater Accra" "Volta"
## [5] "Eastern" "Ashanti" "Brong-Ahafo" "..Northern"
## [9] "..Upper West" "..Upper East"
#We need to modify some names so that they can match in the dhs file and in the shapefile
#easier to modify on the dhs data
pr_gha_tidy$Region[which(pr_gha_tidy$Region=="..Upper East")] = "Upper East"
pr_gha_tidy$Region[which(pr_gha_tidy$Region=="..Northern")] = "Northern"
pr_gha_tidy$Region[which(pr_gha_tidy$Region=="..Upper West")] = "Upper West"
pr_gha_tidy$Region[which(pr_gha_tidy$Region=="Brong-Ahafo")] = "Brong Ahafo"
unique(pr_gha_tidy$Region)## [1] "Western" "Central" "Greater Accra" "Volta"
## [5] "Eastern" "Ashanti" "Brong Ahafo" "Northern"
## [9] "Upper West" "Upper East"
Great! Now all the regions in the DHS database look like the ones we have in our Ghana shapefile.
We need to specify by which columns to merge our two objects. For this we use the by.x and by.y options and specify the column name for each object.
Interesting: The DHS dataset has 3 entries for each region. That is, for example, for Ashanti we have prevalence values for the years 2014, 2016 and 2019. Then we need to add the option duplicateGeoms = TRUE. This options allows to duplicate the polygons so that they can be linked to different prevalence values. If we don’t do this, we are gonna an error. (Try it!)
pr_gha_shp = raster::merge(Admin1, pr_gha_tidy, by.x = "ADM1_NAME", by.y = "Region", duplicateGeoms=TRUE)Almost there! Now, let’s plot! All we need to do is plot things as we had done for the first map we did.
Interesting things to note:
my_colors <- c(RColorBrewer::brewer.pal(8, "RdBu")[1:3],RColorBrewer::brewer.pal(8, "RdBu")[5:8])
my_colors <- rev(my_colors)
tm_shape(pr_gha_shp)+
tm_polygons("Prevalence"
, palette = my_colors
, border.col = "white")+
tm_facets("Year")+
tm_layout (frame = F
, asp = 0
, panel.label.bg.color = "white"
, panel.label.size = 1.3
, frame.lwd = NA
, legend.text.size = 1.
, legend.title.size = 1.4
, legend.outside.position = "right"
, legend.outside.size = .2)Can you make a faceted map for the other indicator that we looked in Exc. 1, “ML_NETC_C_ANY”?
Result:
Now we will do an interactive plot. This kind of plots are great to incorporate to dashboards or apps build in Shiny. Here we are gonna do a simple version of the prevalence map in Ghana that will take literally one second!
All we need to do is give a name to our map, in this case g , add the option as.layers=TRUE in tm_facets and then add the line tmap_leaflet(g) at the end of the script. That’s it! We have a nice interactive plot! Go and play with the layers.